Human genetic clustering

Human genetic clustering analysis uses mathematical cluster analysis of the degree of similarity of genetic data between individuals and groups to infer population structures and assign individuals to groups that often correspond with their self-identified geographical ancestry. A similar analysis can be done using principal components analysis, which in earlier research was a popular method.^[1] Many of recent studies in the past few years have returned to using principal components analysis.

1 Studies
2 See also
3 References

Studies

Phylogenetic tree by Cavalli-Sforza & al. (1994)

Neighbor-joining method, by Naruya Saitou and Masatoshi Nei (2002)

Clusters by Rosenberg & al. (2006)

Main article: Race and genetics

In 2004, Lynn Jorde and Steven Wooding argued that "Analysis of many loci now yields reasonably accurate estimates of genetic similarity among individuals, rather than populations. Clustering of individuals is correlated with geographic origin or ancestry."^[3]

A study by Neil Risch in 2005 used 326 microsatellite markers and self-identified race/ethnic group (SIRE), white (European American), African-American (black), Asian and Hispanic (individuals involved in the study had to choose from one of these categories), to representing discrete "populations", and showed distinct and non-overlapping clustering of the white, African-American and Asian samples. The results were claimed to confirm the integrity of self-described ancestry: "We have shown a nearly perfect correspondence between genetic cluster and SIRE for major ethnic groups living in the United States, with a discrepancy rate of only 0.14%."(Tang, 2005)

Studies such as those by Risch and Rosenberg use a computer program called STRUCTURE to find human populations (gene clusters). It is a statistical program that works by placing individuals into one of two clusters based on their overall genetic similarity, many possible pairs of clusters are tested per individual to generate multiple clusters.^[4] These populations are based on multiple genetic markers that are often shared between different human populations even over large geographic ranges. The notion of a genetic cluster is that people within the cluster share on average similar allele frequencies to each other than to those in other clusters. (A. W. F. Edwards, 2003 but see also infobox "Multi Locus Allele Clusters") In a test of idealised populations, the computer programme STRUCTURE was found to consistently under-estimate the numbers of populations in the data set when high migration rates between populations and slow mutation rates (such as single-nucleotide polymorphisms) were considered.^[5]

Nevertheless the Rosenberg et al. (2002) paper shows that individuals can be assigned to specific clusters to a high degree of accuracy. One of the underlying questions regarding the distribution of human genetic diversity is related to the degree to which genes are shared between the observed clusters. It has been observed repeatedly that the majority of variation observed in the global human population is found within populations. This variation is usually calculated using Sewall Wright's Fixation index (F_ST), which is an estimate of between to within group variation. The degree of human genetic variation is a little different depending upon the gene type studied, but in general it is common to claim that ~85% of genetic variation is found within groups, ~6–10% between groups within the same continent and ~6–10% is found between continental groups. For example The Human Genome Project states "two random individuals from any one group are almost as different [genetically] as any two random individuals from the entire world."^[6] Sarich and Miele, however, have argued that estimates of genetic difference between individuals of different populations fail to take into account human diploidity.

The point is that we are diploid organisms, getting one set of chromosomes from one parent and a second from the other. To the extent that your mother and father are not especially closely related, then, those two sets of chromosomes will come close to being a random sample of the chromosomes in your population. And the sets present in some randomly chosen member of yours will also be about as different from your two sets as they are from one another. So how much of the variability will be distributed where? First is the 15 percent that is interpopulational. The other 85 percent will then split half and half (42.5 percent) between the intra- and interindividual within-population comparisons. The increase in variability in between-population comparisons is thus 15 percent against the 42.5 percent that is between-individual within-population. Thus, 15/42.5 is 32.5 percent, a much more impressive and, more important, more legitimate value than 15 percent.^[7]

Additionally, Edwards (2003) claims in his essay "Lewontin's Fallacy" that: "It is not true, as Nature claimed, that 'two random individuals from any one group are almost as different as any two random individuals from the entire world'" and Risch et al. (2002) state "Two Caucasians are more similar to each other genetically than a Caucasian and an Asian." It should be noted that these statements are not the same. Risch et al. simply state that two indigenous individuals from the same geographical region are more similar to each other than either is to an indigenous individual from a different geographical region, a claim few would argue with. Jorde et al. put it like this:

The picture that begins to emerge from this and other analyses of human genetic variation is that variation tends to be geographically structured, such that most individuals from the same geographic region will be more similar to one another than to individuals from a distant region.^[3]

Whereas Edwards claims that it is not true that the differences between individuals from different geographical regions represent only a small proportion of the variation within the human population (he claims that within group differences between individuals are not almost as large as between group differences). Bamshad et al. (2004) used the data from Rosenberg et al. (2002) to investigate the extent of genetic differences between individuals within continental groups relative to genetic differences between individuals between continental groups. They found that though these individuals could be classified very accurately to continental clusters, there was a significant degree of genetic overlap on the individual level, to the extent that, using 377 loci, individual Europeans were about 38% of the time more genetically similar to East Asians than to other Europeans.

Infobox

Multi-Locus Allele Clusters

In a haploid population, when a single locus is considered (blue), with two alleles, + and - we can see a differential geographical distribution between Population I (70% +) and Population II (30% +).

When we want to assign an individual to one of these populations using this single locus we will assign any + to population I because the probability (p) of this allele belonging to Population I is p = 0.7, the probability (q) of incorrectly assigning this allele to Population I is q = 1 − p, or 0.3. This amounts to a Bernoulli trial because the answer to the question "is this the correct population?" is a simple yes or no. This makes the test Binomially distributed but with a single trial.

But when three loci per individual are taken into account, each with p = 0.7 for a + allele in Population I the average number of + alleles per individual becomes kp = 2.1 (number of trials (k = 3) × probability for each allele (p = 0.7)) and 0.9 (3 × 0.3) + alleles per individual in Population II. This is sometimes referred to as the population trait value. Because alleles are discrete entities we can only assign an individual to a population based on the number of whole + alleles it contains. Therefore we will assign any individual with three or two + alleles to Population I, and any individual with one or fewer + alleles to population II.

The binomial distribution with three trials and a probability of 0.7 shows that the probability of and individual from this population having a single + allele is 0.189 and for zero + alleles it is 0.027, which gives a misclassification rate of 0.189 + 0.027 = 0.216, which is a smaller chance of misclassification than for a single allele. Misclassification becomes much smaller as we use more alleles. When more loci are taken into account, each new locus adds an extra independent test to the binomial distribution, decreasing the chance of misclassification.

Using modern computer software and the abundance of genetic data now available, it is possible not only to distinguish such correlations for hundreds or even thousands of alleles, which form clusters, it is also possible to assign individuals to given populations with very little chance of error. It should be noted, however, that genes tend to vary clinally, and there are likely to be intermediate populations that reside in the geographical areas between our sample populations (Population III, for example, may lie equidistantly from Population I and Population II). In this case it may well be that Population III may display characteristics of both population I and Population II and have intermediate frequencies for many of the alleles used for classification, causing this population to be more prone to misclassification.

The existence of allelic clines and the observation that the bulk of human variation is continuously distributed, has led some scientists to conclude that any categorization schema attempting to partition that variation meaningfully will necessarily create artificial truncations. (Kittles & Weiss 2003). It is for this reason, Reanne Frank argues, that attempts to allocate individuals into ancestry groupings based on genetic information have yielded varying results that are highly dependent on methodological design.^[8] Serre and Pääbo (2004) make a similar claim:

The absence of strong continental clustering in the human gene pool is of practical importance. It has recently been claimed that “the greatest genetic structure that exists in the human population occurs at the racial level” (Risch et al. 2002). Our results show that this is not the case, and we see no reason to assume that “races” represent any units of relevance for understanding human genetic history.

In a response to Serre and Pääbo (2004), Rosenberg et al. (2005) make three relevant observations. Firstly they maintain that their clustering analysis is robust. Secondly they agree with Serre and Pääbo that membership of multiple clusters can be interpreted as evidence for clinality (isolation by distance), though they also comment that this may also be due to admixture between neighbouring groups (small island model). Thirdly they comment that evidence of clusterdness is not evidence for any concepts of "biological race".^[9]

Risch et al. (2002) state that "two Caucasians are more similar to each other genetically than a Caucasian and an Asian", but Bamshad et al. (2004)^[10] used the same data set as Rosenberg et al. (2002) to show that Europeans are more similar to Asians 38% of the time than they are to other Europeans when only 377 microsatellite markers are analysed.

Percentage similarity between two individuals from different clusters when 377 microsatellite markers are considered.^[11]
x	Africans	Europeans	Asians
Europeans	36.5	—	—
Asians	35.5	38.3	—
Indigenous Americans	26.1	33.4	35

In agreement with the observation of Bamshad et al. (2004), Witherspoon et al. (2007) have shown that many more than 326 or 377 microsatellite loci are required in order to show that individuals are always more similar to individuals in their own population group than to individuals in different population groups, even for three distinct populations.^[6]

Witherspoon et al. (2007) have argued that even when individuals can be reliably assigned to specific population groups, it may still be possible for two randomly chosen individuals from different populations/clusters to be more similar to each other than to a randomly chosen member of their own cluster. They found that many thousands of genetic markers had to be used in order for the answer to the question "How often is a pair of individuals from one population genetically more dissimilar than two individuals chosen from two different populations?" to be "never". This assumed three population groups separated by large geographic ranges (European, African and East Asian). The entire world population is much more complex and studying an increasing number of groups would require an increasing number of markers for the same answer. Witherspoon et al. conclude that "caution should be used when using geographic or genetic ancestry to make inferences about individual phenotypes."

Clustering does not particularly correspond to continental divisions. Depending on the parameters given to their analytical program, Rosenberg and Pritchard were able to construct between divisions of between 4 and 20 clusters of the genomes studied, although they excluded analysis with more than 6 clusters from their published article. Probability values for various cluster configurations varied widely, with the single most likely configuration coming with 16 clusters although other 16-cluster configurations had low probabilities. Overall, "there is no clear evidence that K=6 was the best estimate" according to geneticist Deborah Bolnick (2008:76-77).^[12]

A study by the The HUGO Pan-Asian SNP Consortium in 2009 using the similar principle components analysis found that East Asian and South-East Asian populations clustered together, and suggested a common origin for these populations. At the same time they observed a broad discontinuity between this cluster and South Asia, commenting "most of the Indian populations showed evidence of shared ancestry with European populations". It was noted that "genetic ancestry is strongly correlated with linguistic affiliations as well as geography".

References

^ Population Structure and Eigenanalysis, Nick Patterson, Alkes L. Price, David Reich, s. PLoS Genet 2(12): e190. doi:10.1371/journal.pgen.0020190
^ Saitou. Kyushu Museum. 2002. February 2, 2007
^ ^a ^b Lynn B Jorde & Stephen P Wooding, 2004, "Genetic variation, classification and 'race'" in Nature Genetics 36, S28–S33 Genetic variation, classification and 'race'
^ "Genetic Similarities Within and Between Human Populations" (2007) by D.J. Witherspoon, S. Wooding, A.R. Rogers, E.E. Marchani, W.S. Watkins, M.A. Batzer and L.B. Jorde. Genetics. 176(1): 351–359.
^ Wapples, R., S. and Gaggiotti, O. What is a population? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity Molecular Ecology (2006) 15: 1419–1439. doi:10.1111/j.1365-294X.2006.02890.x
^ ^a ^b Genetic Similarities Within and Between Human Populations by D. J. Witherspoon, S. Wooding, A. R. Rogers, E. E. Marchani, W. S. Watkins, M. A. Batzer, and L. B. Jorde Genetics. 2007 May; 176(1): 351–359.
^ Sarich VM, Miele F. Race: The Reality of Human Differences. Westview Press (2004). ISBN 0-8133-4086-1
^ Back with a Vengeance: the Reemergence of a Biological Conceptualization of Race in Research on Race/Ethnic Disparities in Health Reanne Frank
^ Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, et al. (2005) Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure. PLoS Genet 1(6): e70 doi:10.1371/journal.pgen.0010070
^ Bamshad, Wooding, Salisbury§ and Stephens (2004) Deconstructing the relationship between genetics and race. Nature Reviews Genetics 8:598–609. doi:10.1038/nrg1401
^ The table gives the percentage likelihood that two individuals from different clusters are genetically more similar to each other than to someone from their own population when 377 microsatellite markers are considered from Bamshad et al. (2004)doi:10.1038/nrg1401, original data from Rosenberg (2002).
^ Bolnick, Deborah A. (2008). "Individual Ancestry Inference and the Reification of Race as a Biological Phenomenon". In Koenig, Barbara A.; Richardson, Sarah S.; Lee, Sandra Soo-Jin. Revisiting race in a genomic age. Rutgers University Press. ISBN 9780813543246.

Human genetics

Human evolutionary genetics · Human genetic engineering · Human genetic variation · Race and genetics · Human genome · Human evolution - (Timeline)

Genetic history of ... Africa (North Africa) · Americas · British Isles · Europe (African admixture in Europe) · Italy · Iberian Peninsula · Near East · South Asia

Human group differences

Gender/Sex	Gender \| Gender differences \| Intersex \| Sexual Orientation \| Psychology \| Crime \| Suicide \| Illness

Race	Health (US) \| Crime (US, UK) \| Intelligence \| Face perception \| Genetics

Other dynamics	Religiosity \| Blood type \| Human genetic variation \| Y-DNA haplogroups